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Summary. Techniques for data-mining, latent semantic analysis, contextual search of databases, 
etc. have long ago been developed by computer scientists working on information retrieval 
(IR). Experimental scientists, from all disciplines, having to analyse large collections of raw 
experimental data (astronomical, physical, biological, etc.) have developed powerful methods 
for their statistical analysis and for clustering, categorising, and classifying objects. Finally, 
physicists have developed a theory of quantum measurement, unifying the logical, algebraic, 
and probabilistic aspects of queries into a single formalism. 

The purpose of this paper is twofold: first to show that when formulated at an abstract 
level, problems from IR, from statistical data analysis, and from physical measurement the- 
ories are very similar and hence can profitably be cross-fertilised, and, secondly, to propose 
a novel method of fuzzy hierarchical clustering, termed semantic distillation — strongly in- 
spired from the theory of quantum measurement — , we developed to analyse raw data coming 
from various types of experiments on DNA arrays. We illustrate the method by analysing DNA 
arrays experiments and clustering the genes of the array according to their specificity. 

Keywords: Quantum information retrieval, semantic distillation, DNA microarray, quan- 
tum and fuzzy logic 



1 Introduction 

Sequencing the genome constituted a culminating point in the analytic approach of 
Biology. Now starts the era of the synthetic approach in Systems Biology where 
interactions among genes induce their differential expression that leads to the func- 
tional specificity of cells, the coherent organisation of cells into tissues, organs, and 
finally organisms. 

However, we are yet far from a complete explanatory theory of living matter. It is 
therefore important to establish precise and quantitative phenomenology before be- 
ing able to formulate a theory. The contribution of this paper is to provide the reader 
with a novel algorithmic method, termed semantic distillation, to analyse DNA ar- 
rays experiments (where genes are hybridised with various cell lines corresponding 
to various tissues or specific individuals) by determining the degree of specificity of 
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every gene to the particular context. The method provides experimental biologists 
with lists of candidate genes (ordered by their degree of specificity) for every bio- 
logical context, clinicians with improved tools for diagnosis, pharmacologists with 
patient-tailored therapies, etc. 

In the sequel we present the method split into several algorithmic tasks thought as 
subroutines of the general algorithm. It is worth noting that the method, although can 
profitably exploit, does not rely on any previous information stored in the existing 
databases; its rationale is to help analysing raw experimental data even in the absence 
of any previous knowledge. 

The main idea of the method is summarised as follows. Experimental information 
hold on the objects of the system undergoes a sequence of processing steps; each 
step is performed on a different representation of the information. Those different 
representation spaces and the corresponding information processing act as successive 
filters revealing at the end the most pertinent and significant part of the information, 
hence the name "semantic distillation". 

At the first stage, raw experimental data, containing all available information, are 
represented in an abstract Hilbert space, the space of concepts — reminiscent of the 
space of pure states in Quantum Mechanics — , endowing the set of objects with a 
metric space structure that is exploited to quantify the interactions among objects 
and encode them into a weighed graph on the vertex set of objects and with object 
interactions as edge weights. 

Now objects (genes) are parts of an organised system (cell, tissue, organism). 
Therefore their mutual interactions are not just independent random variables; they 
are interconnected through precise, although certainly very complicated and mostly 
unknown relationships. We seek to reveal (hidden and unknown) interactions among 
genes. This is achieved by trading the weighed graph representation for a low- 
dimensional representation and using spectral properties of the weighed Laplacian 
on the graph to grasp the essential interactions. 

The following step consists in a fuzzy divisive clustering of objects among two 
subsets by exploiting the previous low-dimensional representation. This procedure 
assigns a fuzzy membership to each object relative to characters of the two subsets. 
Fuzziness is as a matter of fact a distinctive property of experimental biological data 
reflecting our incomplete knowledge of fundamental biological processes. 

Up to this step, our method is a sequence of known algorithms that have been 
previously used separately in the literature in various contexts. The novelty of our 
method relies on the following steps. The previous fuzzy clustering reduced the in- 
determinacy of the system. This information is fed back to the system to perform a 
projection to a proper Hilbert subspace. In that way, the information content of the 
dataset is modified by the information gained by the previous observations. After 
this feeding back, the three previous steps are repeated but now referring to a Hilbert 
spaces of lower dimension. Therefore our method is not a mere fuzzy clustering algo- 
rithm but a genuine non-classical interaction information retrieval procedure where 
previous observations alter the informational content of the system, reminiscent of 
the measurement procedure in Quantum Mechanics. 
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2 A Hilbert space formulation 

2.1 Mathematical form of the dataset 

Let B be a finite set of documents (or objects, or books) and A a finite set of attributes 
(or contexts, or keywords). The dataset is a |B| x | A| matrix X = (x& a )&eB,aeA of real 
or complex elements, where | • | represents cardinality. Equivalent ways of represent- 
ing the dataset are 

• a collection of |B| row vectors x& = (jq,i, . . . ,x h ^),b G B of ]Rl A (or C' A '), 

• a collection of |A| column vectors x a = (xi a , . . . ,x\ B \ a ),a <= A of Rl B l (or Cl B l). 

Example 1. In the experiments we analysed B is a set of 12000 human genes and 
A a set of 12 tissular contexts. The matrix elements x\, a are real numbers encoding 
luminescence intensities (or their logarithms) of DNA array ultimately representing 
the level of expression of gene b in context a. 

Example 2. Let B be a set of books in a library and A a set of bibliographic keywords. 
The matrix elements xi, a can be {0, l}-valued: if the term a is present in the book b 
then Xb a = 1 else x\, a = 0. A variant of this example is when Xb a are integer valued: 
if the term a appears k times in document b then xi, a = k. 

Example 3. Let B be a set of students and A a set of papers they gave. The matrix 
elements x\, a are real valued; xi, a is the mark the student b got in paper a. 

The previous examples demonstrate the versatility of the method by keeping the 
formalism at an abstract level to apply indistinctively into various very different sit- 
uations without any change. Note also that the assignment as set of documents or at- 
tributes is a matter of point of view; for instance, example 3 as it stands is convenient 
in evaluating students. Interchanging the role of sets A and B renders it adapted to 
the evaluation of teaching. As a rule of thumb, in biological applications, |A| <C |B|. 

2.2 The space of concepts 

For A and B as in the previous subsection, we define the space of concepts, M'^, as 
the real or complex free vector space over A, i.e. elements of A serve as indices of 
an orthonormal basis of J^. Therefore, the complete dataset X can be represented 
as the collection of B vectors | £&) = LaeA x i«l a ) S ^a, with b G B and where 
| a ) represents the element of the orthonormal basis of the free vector space corre- 
sponding to the attribute a. We use here Dirac's notation to represent vectors, linear 
forms and projectors on this space (see any book on quantum mechanics or [26] 
for a freely accessible document and [29] for the use of this notation in information 
retrieval). The vector | S# ) contains all available experimental information on docu- 
ment b in various cellular contexts indexed by the attributes a; it can be thought as 
a convenient bookkeeping device of the data (x& a ) a eA> m the same way a generating 
function contains all the information on a sequence as formal power series. 
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The vector space is equipped with a scalar product defined for every two vectors 
IV) = LaeA¥a\a) and |y/} = ZaeAWiW) by (vIvO = LaeAWaVa, where "y? a 
denotes the complex conjugate of y/ a (it coincides with y/ a if it is real). Equipped with 
this scalar product, the vector space Jt?& becomes a real or complex | A | -dimensional 
Hilbert space. The scalar product induces a Hilbert norm on the space, denoted by 
|| • || . In the sequel we introduce also rays on the Hilbert space i.e. normalised vectors. 
Since the dataset X does not in principle verify any particular numerical constraints, 
rays are constructed by dividing vectors by their norms. We use the symbol | 4a) = 
| Sfc )/|| | Eb ) || to denote the ray associated with vector | ). 

The Hilbert space structure on allows a natural geometrisation of the space 
of documents by equipping it with a pseudo-distance 3 if:BxB^ R + defined by 
d(b,b r ) = |||4a) — |4a')II- What is important here is not the precise form of the 
pseudo-metric structure of (B, d); several other pseudo-distances can be introduced, 
not necessarily compatible with the scalar product. In this paper we stick however to 
the previous pseudo-distance, postponing into a later publication explanations about 
the significance of other pseudo-distances. 

As is the case in Quantum Mechanics, the Hilbert space description incorporates 
into a unified algebraic framework all logical and probabilistic information hold by 
the dataset. An enquiry of the type "does the system possess feature F" is encoded 
into a projector Pf acting on the Hilbert space. The subspace associated with the pro- 
jector Pf is interpreted as the set of documents retrieved by asking the question about 
the feature F. Now all experimental information hold by the dataset is encoded into 
the state of the system represented by a density matrix p (i.e. a self-adjoint, positive, 
trace class operator acting on having unit trace). Retrieved documents possess 
the feature F with probability tr(pPf). Thus the algebraic description incorporates 
logical information on the documents retrieved as relevant to a given feature and 
assign them a probability determined by the state defined by the experiment. For ex- 
ample, the probability that a gene b is relevant to an attribute a is given by the above 
formula with P = \a)(a\ and p = |4a)(4& I » yielding tr(pP) = |(<i;,f, |a)| 2 . 

3 A weighed graph with augmented vertex set 

The careful reader has certainly already noted that in the above description vectors 
| 4a ), encoding the information about document b, and basis vectors | a ), associated 
with attribute a, all belong to the same Hilbert space Jff^. Therefore, although ini- 
tially the sets A and B are disjoint since they have distinct elements, when passing 
to the Hilbert space representation, vectors | 4a ) an d I a ) have very similar roles in 
representing indistinguishably objects or attributes as vectors of In the sequel, 
we introduce the set V (or more precisely Va to remove any ambiguity) as the set 
V A = A U B. Thus, for any v e V A , 

3 It is termed pseudo-distance since it verifies symmetry and triangle inequality but d(b,b r ) 
can vanish even for different b and b'. As a matter of fact, d is a distance on the projective 
Hilbert space. 
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\a) if v = a 6 A, 

LaeAXba\a) if v = beM. 



The new vectors \ E a ) = \ a ) are included as specificity witnesses in the dataset. Note 
that since these new vectors are also elements of the same Hilbert space, the pseudo- 
distance d naturally extends to Va- 

Suppose now that a similarity function a : Va x Va — > [0,1] is defined. For 
the sake of definiteness, the reader can think of a as being given, for example 4 , 



by a(v,v') = y 1 — jd(v,v') 2 ; results we quote in section 5 are obtained with a 
slight modification of this similarity function. However, again, the precise form of 
the similarity function is irrelevant in the abstract setting serving as foundation of 
the method. Several other similarity functions have been used like, for example, 
ct(v,v') = exp(— ||£ v — S v /|| 2 /t) with % a positive constant or some others, in par- 
ticular, functions taking value even for some vertices corresponding to non or- 
thogonal rays but the explanation of their significance is postponed to a subsequent 
publication. 

A weighed graph is now constructed with vertex set Va- Weights are assigned 
to the edges of the complete graph over Va; the weights being expressible in terms 
of the similarity function a. Again, the precise expression is irrelevant for the ex- 
position of the method. For the sake of concreteness, the reader can suppose that 
the weights W n ,r are given by W n ,i — o(v,v'). The pair (Va,W) with W being the 
symmetric matrix W = (W vv /)v,veV A > denotes the weighed graph. 

At this level of the description we follow now standard techniques of reduction 
of the data dimensionality by optimal representation of the graph in low dimensional 
Euclidean spaces spanned by eigenvectors of the Laplacian. Such methods have been 
used by several authors [4, 24]. Here we give only the basic definitions and main 
results of this method. The interested reader may consult standard textbooks like 
[8, 10, 15] for general exposition of the method. 

Definition 1. A map r : Va — ► K v is called a v '-dimensional representation of the 
graph. The representation is always supposed non-trivial (i.e. r ^ 0) and balanced 



From the weights matrix W we construct the weighed Laplacian matrix A = D — W 
where the matrix elements D vv i are if v ^ v' and equal to Lv"eV A W vv n if v = v'. More 
precisely, we denote by A (Va) this weighed Laplacian to indicate that it is defined 
on the vertex set Va- This precision will be necessary in the next section specifying 
the semantic distillation algorithm where the vertex set will be recursively modified 
at each step. The weighed energy of the representation is given by 



where in this formula || • || denotes the Euclidean norm of M v . 

4 This function is well adapted to datasets X = (xj, a ), with x/, a e R+; for more general 
datasets, the factor 1 /2 must be changed to 1 /4. 



(i-e. Lvev A r(v)=0). 



<£Mr)= £ W vv ,||r(v)-r(v')|| 2 , 



v/ev A 
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Theorem 1. Let N = |Va| and {X\,. . . ,Xn} be the spectrum of A, ordered as X\ < 
A2 • • • Xn- Suppose that X2 > 0. Then inf r £\y (r) = Y,J=2 wnere the infimum is over 
all V -dimensional non-trivial balanced representations of the graph. 

Remark 1. If u 1 , . . .u N are the eigenvectors of A corresponding to the eigenvalues 
Ai, . . . ,Ajy ordered as above, then u 2 is the best one-dimensional, [u 2 ,u 3 ] the best 
two-dimensional, etc, [u 2 , . . . ,u v+1 ] the best v-dimensional representation of the 
graph (V A ,W). 

4 Fuzzy semantic clustering and distillation 

The algorithm of semantic distillation is a recursive divisive fuzzy clustering fol- 
lowed by a projection on a Hilbert subspace and a thinning of the graph. It starts 
with the Hilbert space and the graph with vertex set Va and constructs a se- 
quence of Hilbert subspaces and subgraphs indexed by the words K of finite length 
on a two-letter alphabet. This set is isomorphic to a subset of the rooted binary tree. 
If K is the root, then define M K = A. Otherwise, M K will be a proper subset of A, i.e. 
C Mk C A,indexed by K. When |M K | = 1 then the corresponding K is a leaf of the 
binary tree. The algorithm stops when all indices correspond to leaves. 

More precisely, let K = {1,2}, K° = {k : K = ()}, and for integers n > 1 let 
K n = {k : K = Ki ■ ■ ■ Kn\ Ki e K}. Finally let K* = U„>oIK" denote the set of words on 
two letters of indefinite length, including the empty sequence, denoted by (), of zero 
length that coincides with the root of the tree. If K — K\ ■ ■ ■ K n is a word of n letters 
and k G K, we denote the concatenation Kk as the word of n + 1 letters K\---K n k. 

We start from the empty set Leaves = {}, the empty sequence K = () and the 
current attributes set = Mq = A and current tree Tree = {k}. We denote = 
EUMf. We need further a fuzzy membership function m : V K x K— > [0, 1], The fuzzy 
clustering algorithm is succinctly described as Algorithm 1 below. 

Data: K, M k , r, objective function F 

Result: Two sets M K \ and and the fuzzy membership m(v,k) for veV c in the 

clusters M K \ and M K 2 
ii\M K \ > 1 then 

assign (vi,v 2 ) <- argmax{||r(v) -r(v')||,v,v' e V K }; 

assign r(vi) and r(v2) as centroids for the two candidate finer clusters M K i and 

use standard 2-means fuzzy clustering algorithm to minimise objective function F 
under the constraint E|=i m(v,k) = 1, for all v e M K ; 
assign M Kl <- {v e M K : m(v, 1) > m(v,2)}; 
assign M k2 <- \ M Kl ; 

end 

Algorithm 1: FuzzyClustering 



Semantic distillation 7 

Note that in the previous construction M K j C for every K and every tel. 
Therefore, the algorithm explores the branches of a tree from the root to the leaves. 
Denote by % K the orthogonal projection from to J^m k . The distillation step is 
described by the following Algorithm 2. 



Data: FuzzyClustering 

Result: Leaves and sequence of singleton sets M. K for K £ Leaves 
Initialisation! 

*<-(); 

M c <— A; 
Leaf(K-) ^Mic; 
Leaves <— {}; 
Tree <— {ic}; 
Bookkeeping <— {ic}; 
} 

while Bookkeeping 7^ do 
for ic £ Bookkeeping do 
if |M K | = 1 then 

Leaves <— Leaves U { K"} ; 
Bookkeepings Bookkeeping\ {ic}; 

else 

Use n K to project from Jffa to ,3%m k ; 
Thin the graph: V K <- BuM c ; 
Compute weighed Laplacian A ( V K ) ; 
Diagonalise A (¥*•); 

Compute v-dimensional representation r; 
Call FuzzyClustering; 
for k £ K do 

k' <— ic/:; 

Leaf(ff') <- M K / /* M K as determined by FuzzyClustering */; 
Tree <— TreeU{lc'}; 
Bookkeepings Bookkeeping U {ic'}; 

end 

end 

end 
end 

Algorithm 2: Distillation 



5 Illustration of the method, robustness and complexity issues 

We tested the method on a dataset for an experiment on DNA array published in 
[35], with the set A of attributes corresponding to 12 cell lines (bone marrow, liver, 
heart, spleen, lung, kidney, skeletal muscle, spinal cord, thymus, brain, prostate, pan- 
creas) and the set B of documents corresponding to 12000 human genes. To illus- 
trate the method we present here only an example of the type of results we obtain 
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by our method for the simplest case of one-dimensional representation of the graph. 
The complete lists of specificity degrees for the various genes (including their Uni- 
Gene identifiers) for various dimensions are provided as supplemental material (at 
the home page of the first author). 

Note that for one-dimensional representation, ordering by the magnitude of the 
eigenvector components is equivalent to a relabelling of genes. The figure 1 repre- 
sents, within the previous mentioned relabelling, the levels of expressions for clus- 
tered genes. The same procedure has been applied for higher dimensional represen- 




liver skeletal muscle 



Fig. 1. For every singleton cluster, i.e. tissular context K 6 Leaves (we present solely the cases 
Mr = {liver} and M K = { skeletal muscle} in this example), the horizontal axis contains the 
set B of genes relabelled according to their decreasing (resp. increasing) fuzzy membership to 
M. K . Vertical axis represents the experimentally measured level of expression for those genes. 

tation of the graph (i.e. v > 1). These results are not presented here; they marginally 
improved some specifications and helped us removing apparent degeneracy in some 
cases. Finally, in the table 1, we give an example of the annotation provided by the 
database UniGene for the genes classified as specific of skeletal muscle cell line by 
our method. 

We observe that the majority of genes classified as most specific by our method 
are in fact annotated as specific in the database. To underline the power of our 
method, note that the UniGene annotation for the ATPase gene is "cardiac mus- 
cle". Our method determines it as most specific of "skeletal muscle". We checked 
the experimental data we worked on and realised that this gene is, as a matter of fact, 
5 times more expressed in the skeletal muscle context than in the cardiac muscle. 
Therefore, our method correctly determines this gene as skeletal-muscle-specific. 

In summarising, our method is an automatic and algorithmic method of analy- 
sis of raw experimental data; it can be used to any experiment of similar type in- 
dependently of any previous knowledge included in genomic databases to provide 
biologists with a powerful tool of analysis. In particular, since most of the genes 
are not yet annotated in the existing databases, the method provides biologists with 
candidate genes for every particular context for further investigation. Moreover, the 
genetic character of documents and attributes is purely irrelevant; the same method 
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Table 1. Annotation of the genes closest (within the relabelling induced by u 2 ) to the speci- 
ficity witness "skeletal muscle". Genes are separated by the - symbol. 

ATPa.se, Ca+ + transporting, cardiac muscle, fast twitch I, calcium signaling pathway - Troponin I type 2; skeletal, 
fast - Myosin, light chain I, alkali; skeletal, fast - Ryanodine receptor 1; skeletal; calcium signaling pathway 

- Fructose-l,6-bisphosphatase 2, glycolysis / gluconeogenesis - Actinin, alpha 3; focal adhesion - Adenosine 
monophosphate deaminase 1 (isoform M) purine metabolism - Troponin C type 2; fast; calcium signaling pathway - 
Carbonic anhydrase III, muscle specific; nitrogen metabolism - Nebulin - Troponin I type 1; skeletal, slow - Myosin, 
heavy chain 3, skeletal muscle - Myogenic factor 6, herculin - Myosin binding protein C, fast type - Calcium channel, 
voltage-dependent, beta 1 subunit - Metallothionein IX - Bridging integrator 1 - Bridging integrator I - Calpain 
3, (p94) - Tropomyosin 3 - Phosphorylase, glycogen; muscle (McArdle syndrome, glycogen storage disease type V); 
starch and sucrose metabolism - Myozenin 3 - Myosin binding protein C, slow type - Troponin T type 3; skeletal, fast 

- Superoxide dismutase 2; mitochondrial - Nicotinamide N-methyltransf erase - Sarcolipin - Interleukin 32 - Sodium 
channel, voltage-gated, type IV, alpha subunit - Guanidinoacetate N-methyltransferase; urea cycle and metabolism of 
amino groups. 



can be used to any other dataset of similar structure, let them concern linguistic, 
genetic, or image data. 

Concerning the algorithmic complexity of the method, the dominant contribution 
comes from the diagonalisation of a |B| x |B| dense real symmetric matrix, requiring 
at worst i^(|B| 3 ) time steps and i^(|B| 2 ) space. The time complexity can be slightly 
reduced, if only low-dimensional (dimension v) representations are sought, to^(vx 
|B| 2 ) time steps. Moreover, we tested the method against additive or multiplicative 
random perturbations of the experimental data; it proved astonishingly robust. 



6 Connections to previous work 

The algorithm of semantic distillation maps the dataset into a graph and uses spectral 
methods and fuzzy clustering to analyse the graph properties. As such, this algorithm 
is inspired by various pre-existing algorithms and borrows several elements from 
them. 

The oldest implicit use of a vector space structure to represent dataset and appli- 
cation of spectral methods to analyse them is certainly "principal components anal- 
ysis" introduced in [25]. The method seeks finding directions of maximal variability 
in the space corresponding to linear combinations of the underlying vectors. The 
major drawbacks of principal components analysis are the assumptions that dataset 
matrix is composed of row vectors that are independent and identically distributed 
realisations of the same random vector (hence the covariance matrix whose princi- 
pal components are sought can be approximated by the empirical covariance of the 
process) and that there exists a linear transformation maximising the variability. 

Vector space representations and singular value decomposition, as reviewed in 
[5], have been used to retrieve information from digital libraries. Implementations 
of these ideas range from the famous PageRank algoritm used by Google (see [18] 



10 



Th. Sierocinski et al. 



and [17] for expository reviews) to whole genome analysis based on latent semantic 
indexing [23, 16]. 

From the information contained in the dataset X, a weighed graph of interactions 
among documents is constructed. To palliate the weaknesses of principal component 
analysis, reproducing kernel methods can be used. The oldest account of these meth- 
ods seems to be [21] and their formulation in the context of Hilbert spaces can be 
found in [1]. In [31], analysis of features of a microarray experiment is proposed 
based on kernel estimates on a graph. Note however that in that paper, the graph in- 
corporates extrinsic information coming from participation of genes in specific path- 
ways as documented in the KEGG database. On the contrary, in the method we are 
proposing here, the graph can be constructed in an intrinsic way, even in the absence 
of any additional information from existing databases. In [4, 9, 24], kernel methods 
and Laplace eigenspace decomposition are used to generalise principal components 
analysis to include non-linear interactions among genes. Particular types of kernels, 
defined in terms of commuting times for a random walk on the graph are used in 
[13, 20, 30]. All these methods, although not always explicitly stated in these arti- 
cles, are as a matter of fact very closely related since the kernels, the weighed graph 
Laplacian and the simple random walk on the graph can be described in a unified 
formalism [7, 8, 10, 15, 22]. It is worth noting that analysis of Laplacian of the graph 
is used in many different contexts, ranging from biological applications (proteins 
conformation [32], gene arrays [23]) to web search [3] or image analysis [28]. 

Fuzzy clustering has been introduced in [6]; lately it was shown [33] equivalent 
to probabilistic clustering if the objective function is expressed in terms of the Renyi 
entropy. 

The idea of describing the data in terms of abstract Hilbert spaces has been used 
(in the context of database search) in [2, 12, 14, 29, 34]. 

The semantic distillation algorithm is based on a quantum-inspired subspace pro- 
jection, strongly reminiscent of the quantum procedure of measurement. Although 
fully implemented on classical computers, it shares with general quantum algorithms 
features of non-distributive quantum logic [26, 27]. The semantic approach of Quan- 
tum Mechanics can be found in [27, 1 1]. It is worth underlying that the full fledged 
fuzzy logic induced by quantum semantics is not equivalent to the standard fuzzy 
logic introduced in [36]; it represents a genuine extension of it [1 1]. 

7 Perspectives 

Various data sets (not only biological) are presently semantically distilled and the 
method compared with more traditional approaches. Preliminary results obtained so 
far seem to confirm the power of the method. 
Several directions are in progress: 

• Although the method is quantum-inspired, the fuzzy logic induced is still stan- 
dard fuzzy logic. We are currently working on the extension to generalised fuzzy 
logic induced by full-fledged quantum semantics. 
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• The graph analysis we performed provided us with degrees of specificities of ev- 
ery gene in a particular context. These data can be reincorporated to the graph as 
internal degrees of freedom of a multi-layered graph that can be further analysed. 

• The connections of the algorithm of semantic distillation with the algorithm of 
purification of quantum states [19] introduced in the context of quantum comput- 
ing are currently explored. 
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